Search CORE

285 research outputs found

Fast Algorithm for Partial Covers in Words

Author: A. Apostolico
A. Apostolico
A. Apostolico
A.S. Fraenkel
D. Breslauer
D. Gusfield
D. Moore
E. Ukkonen
G.S. Brodal
G.S. Brodal
J.S. Sim
M. Crochemore
M.R. Brown
Y. Li
Publication venue
Publication date: 01/01/2013
Field of study

A factor

u

of a word

w

is a cover of

w

if every position in

w

lies within some occurrence of

u

w

. A word

w

covered by

u

thus generalizes the idea of a repetition, that is, a word composed of exact concatenations of

u

. In this article we introduce a new notion of

\alpha

-partial cover, which can be viewed as a relaxed variant of cover, that is, a factor covering at least

\alpha

positions in

w

. We develop a data structure of

O(n)

size (where

n=|w|

) that can be constructed in

O(n\log n)

time which we apply to compute all shortest

\alpha

-partial covers for a given

\alpha

. We also employ it for an

O(n\log n)

-time algorithm computing a shortest

\alpha

-partial cover for each

\alpha=1,2,\ldots,n

arXiv.org e-Print Archive

Crossref

Springer - Publisher Connector

King's Research Portal

Space-efficient detection of unusual words

Author: A Apostolico
A Apostolico
CAR Hoare
D Belazzougui
D Belazzougui
J Herold
J Lin
M Crochemore
S Chairungsee
Publication venue
Publication date: 01/01/2015
Field of study

Detecting all the strings that occur in a text more frequently or less frequently than expected according to an IID or a Markov model is a basic problem in string mining, yet current algorithms are based on data structures that are either space-inefficient or incur large slowdowns, and current implementations cannot scale to genomes or metagenomes in practice. In this paper we engineer an algorithm based on the suffix tree of a string to use just a small data structure built on the Burrows-Wheeler transform, and a stack of

O(\sigma^2\log^2 n)

bits, where

n

is the length of the string and

\sigma

is the size of the alphabet. The size of the stack is

o(n)

except for very large values of

\sigma

. We further improve the algorithm by removing its time dependency on

\sigma

, by reporting only a subset of the maximal repeats and of the minimal rare words of the string, and by detecting and scoring candidate under-represented strings that

\textit{do not occur}

in the string. Our algorithms are practical and work directly on the BWT, thus they can be immediately applied to a number of existing datasets that are available in this form, returning this string mining problem to a manageable scale.Comment: arXiv admin note: text overlap with arXiv:1502.0637

arXiv.org e-Print Archive

Crossref

MPG.PuRe

The Longest Common Subsequence Problem Revisited

Author: Apostolico A.
Guerra C.
Publication venue: 'Purdue University (bepress)'
Publication date: 14/10/1985
Field of study

Purdue E-Pubs

A framework for space-efficient string kernels

Author: A Apostolico
A Apostolico
AJ Smola
AM İleri
B Chor
D Belazzougui
G Reinert
GE Sims
J Herold
J Qi
J Shawe-Taylor
M Crochemore
R Chikhi
S Chairungsee
Publication venue
Publication date: 23/02/2015
Field of study

String kernels are typically used to compare genome-scale sequences whose length makes alignment impractical, yet their computation is based on data structures that are either space-inefficient, or incur large slowdowns. We show that a number of exact string kernels, like the

k

-mer kernel, the substrings kernels, a number of length-weighted kernels, the minimal absent words kernel, and kernels with Markovian corrections, can all be computed in

O(nd)

time and in

o(n)

bits of space in addition to the input, using just a

\mathtt{rangeDistinct}

data structure on the Burrows-Wheeler transform of the input strings, which takes

O(d)

time per element in its output. The same bounds hold for a number of measures of compositional complexity based on multiple value of

k

, like the

k

-mer profile and the

k

-th order empirical entropy, and for calibrating the value of

k

using the data

arXiv.org e-Print Archive

Crossref

Off-line compression by greedy textual substitution

Author: A. Apostolico
S. Lonardi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

An output-sensitive algorithm for the minimization of 2-dimensional String Covers

Author: A Apostolico
A Apostolico
A Bacciotti
A Katok
A Muchnik
A Tychonoff
A Wlodawer
AK Brodzik
AV Aho
DE Knuth
J Kopf
JR Searle
K Perlin
L Bursill
R Middlestead
RS Bird
S Havlin
WA Sethares
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/05/2019
Field of study

String covers are a powerful tool for analyzing the quasi-periodicity of 1-dimensional data and find applications in automata theory, computational biology, coding and the analysis of transactional data. A \emph{cover} of a string

T

is a string

C

for which every letter of

T

lies within some occurrence of

C

. String covers have been generalized in many ways, leading to \emph{k-covers}, \emph{

\lambda

-covers}, \emph{approximate covers} and were studied in different contexts such as \emph{indeterminate strings}. In this paper we generalize string covers to the context of 2-dimensional data, such as images. We show how they can be used for the extraction of textures from images and identification of primitive cells in lattice data. This has interesting applications in image compression, procedural terrain generation and crystallography

arXiv.org e-Print Archive

Crossref

Fast linear-space computations of longest common subsequences

Author: Apostolico A.
Browne S.
Guerra C.
Publication venue: Published by Elsevier B.V.
Publication date: 06/01/1992
Field of study

AbstractSpace saving techniques in computations of a longest common subsequence (LCS) of two strings are crucial in many applications, notably, in molecular sequence comparisons. For about ten years, however, the only linear-space LCS algorithm known required time quadratic in the length of the input, for all inputs. This paper reviews linear-space LCS computations in connection with two classical paradigms originally designed to take less than quadratic time in favorable circumstances. The objective is to achieve the space reduction without alteration of the asymptotic time complexity of the original algorithm. The first one of the resulting constructions takes time O(n(m−l)), and is thus suitable for cases where the LCS is expected to be close to the shortest input string. The second takes time O(ml log(min[s, m, 2nl])) and suits cases where one of the inputs is much shorter than the other. Here m and n (m⩽n) are the lengths of the two input strings, l is the length of the longest common subsequences and s is the size of the alphabet. Along the way, a very simple O(m(m−l)) time algorithm is also derived for the case of strings of equal length

Elsevier - Publisher Connector

On Computing Longest Common Subsequences in Linear Space

Author: Apostolico A.
Browne S.
Guerra C.
Publication venue: 'Purdue University (bepress)'
Publication date: 10/11/1988
Field of study

Purdue E-Pubs

Covering Problems for Partial Words and for Indeterminate Strings

Author: A Apostolico
A Apostolico
A Kalai
CS Iliopoulos
CS Iliopoulos
D Breslauer
D Lokshtanov
D Moore
J Holub
KR Abrahamson
MF Bari
MJ Fischer
P Antoniou
R Impagliazzo
R Impagliazzo
T Kociumaka
WF Smyth
Y Li
Publication venue
Publication date: 01/01/2014
Field of study

We consider the problem of computing a shortest solid cover of an indeterminate string. An indeterminate string may contain non-solid symbols, each of which specifies a subset of the alphabet that could be present at the corresponding position. We also consider covering partial words, which are a special case of indeterminate strings where each non-solid symbol is a don't care symbol. We prove that indeterminate string covering problem and partial word covering problem are NP-complete for binary alphabet and show that both problems are fixed-parameter tractable with respect to

k

, the number of non-solid symbols. For the indeterminate string covering problem we obtain a

2^{O(k \log k)} + n k^{O(1)}

-time algorithm. For the partial word covering problem we obtain a

2^{O(\sqrt{k}\log k)} + nk^{O(1)}

-time algorithm. We prove that, unless the Exponential Time Hypothesis is false, no

2^{o(\sqrt{k})} n^{O(1)}

-time solution exists for either problem, which shows that our algorithm for this case is close to optimal. We also present an algorithm for both problems which is feasible in practice.Comment: full version (simplified and corrected); preliminary version appeared at ISAAC 2014; 14 pages, 4 figure

arXiv.org e-Print Archive

Crossref

King's Research Portal

Mining, compressing and classifying with extensible motifs

Author: A Apostolico
A Apostolico
A Chattaraj
A Lempel
Alberto Apostolico
C Nevill-Manning
C Neville-Manning
E Lehman
J Kieffer
JA Storer
Laxmi Parida
M Li
M Li
M Li
Matteo Comin
S DeAgostino
S Vinga
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Motif patterns of maximal saturation emerged originally in contexts of pattern discovery in biomolecular sequences and have recently proven a valuable notion also in the design of data compression schemes. Informally, a motif is a string of intermittently solid and wild characters that recurs more or less frequently in an input sequence or family of sequences. Motif discovery techniques and tools tend to be computationally imposing, however, special classes of "rigid" motifs have been identified of which the discovery is affordable in low polynomial time. RESULTS: In the present work, "extensible" motifs are considered such that each sequence of gaps comes endowed with some elasticity, whereby the same pattern may be stretched to fit segments of the source that match all the solid characters but are otherwise of different lengths. A few applications of this notion are then described. In applications of data compression by textual substitution, extensible motifs are seen to bring savings on the size of the codebook, and hence to improve compression. In germane contexts, in which compressibility is used in its dual role as a basis for structural inference and classification, extensible motifs are seen to support unsupervised classification and phylogeny reconstruction. CONCLUSION: Off-line compression based on extensible motifs can be used advantageously to compress and classify biological sequences

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Archivio istituzionale della ricerca - Università di Padova